Randomization Inference

Edward Vytlacil

Review: Hypothesis Testing

  • Review: Hypothesis Testing

  • Randomization Inference

  • Applying Randomization Inference to PROGRESA

  • Multiple hypothesis testing

Review: Hypothesis Testing

Definition: Null and Alternative Hypotheses

  • Null Hypothesis \(\mathbb{H}_0\),

    • Hypothesis we accept unless there is sufficiently strong evidence against it.
  • Alternative Hypothesis \(\mathbb{H}_1\)

    • Hypothesis we accept if we reject the null.

Review: Hypothesis Testing

Definition: Test Statistic

A test statistic \(T_n\) is a function of our sample that summarizes evidence against the null, with larger values indicating stronger evidence against the null.

Definition: p-value

The p-value is the probability under the null of observing a test statistic as large or larger as the one observed in the sample.

Review: Hypothesis Testing

Definition: Hypothesis Test

A hypothesis test is a decision rule that specifies for which values of the test statistic \(T_n\) we accept or fail to accept the null hypothesis.

Review: Hypothesis Testing

Review: Hypothesis Testing

Definition: Type I & II Errors

  • Type I Error: Reject null when null is true

  • Type II Error: Fail to reject null when null is false

Review: Hypothesis Testing

Definition: Significance Level of Test

A test has significance level \(\alpha\) if probability of Type I error when null hypothesis is true is less than or equal to \(\alpha\).

Review: Hypothesis Testing

Definition: Significance Level of Test

A test has significance level \(\alpha\) if probability of Type I error when null hypothesis is true is less than or equal to \(\alpha\).

Theorem:

Suppose \(p_n\) is a p-value for the null hypothesis \(\mathbb{H}_0\). Then the test that rejects the null whenever \(p_n \le \alpha\) has significance level \(\alpha\).

Randomization Inference

  • Review: Hypothesis Testing

  • Randomization Inference

  • Applying Randomization Inference to PROGRESA

  • Multiple hypothesis testing

Consider Hypothetical RCT

  • We observe \((X_i, Y_i)\) for each individual.

In this example,

  • \(\bar{Y}_1 = \frac{1+1+1}{3}=1\).
  • \(\bar{Y}_0 = \frac{1+0}{2}=\frac{1}{2}\).
\(i\) \(X_i\) \(Y_i\)
1 1 1
2 1 1
3 0 0
4 1 1
5 0 1


\(\bar{Y}_1 -\bar{Y}_0 = \frac{1}{2}\), but is difference just by random chance?

Inference

For any inference procedure:

  • Formally state null and alternative hypotheses,

  • Choose a test statistic to measure evidence against the null hypothesis,

  • Calculate distribution of test statistic under the null to construct p-values, perform hypothesis test.

Null: No Treatment Effect on Average

Definition: Weak Null

  • \(\mathbb{H}_0: \mathbb{E}[ Y_{1,i} - Y_{0,i}] =0\), vs

  • \(\mathbb{H}_1: \mathbb{E}[ Y_{1,i} - Y_{0,i}] \ne 0.\)

  • Null of no effect on average,
    versus alternative of some effect on average.

  • Commonly tested with t-test, based on asymptotic normality.

Null: No Treatment Effect on Anyone

Definition: Strong Null

  • \(\mathbb{H}_0: Y_{1,i} - Y_{0,i}=0\) for all \(i\), vs.

  • \(\mathbb{H}_1: Y_{1,i} - Y_{0,i} \ne 0\) for at least one \(i\).

  • Null of no effect on any subject,
    versus alternative of some effect on someone.

  • Randomization inference will test strong null as we can find distribution of test statistic under sharp null exploiting known randomization procedure.

Consider Hypothetical RCT

  • We observe \((X_i, Y_i)\) for each individual.
  • What do we know about \(Y_{0,i}\), \(Y_{1,i}\) for each individual?
\(i\) \(Y_{0,i}\) \(Y_{1,i}\) \(X_i\) \(Y_i\)
1 1 1
2 1 1
3 0 0
4 1 1
5 0 1

Consider Hypothetical RCT

  • What do we know about \(Y_{0,i}\), \(Y_{1,i}\) for each individual?
  • If \(X_i=1\), then

    • \(Y_{1,i}=Y_{i}\)

    • We do not know \(Y_{0,i}\)

  • If \(X_i=0\), then

    • \(Y_{0,i}=Y_{i}\)

    • We do not know \(Y_{i}\)

\(i\) \(Y_{0,i}\) \(Y_{1,i}\) \(X_i\) \(Y_i\)
1 ? 1 1 1
2 ? 1 1 1
3 0 ? 0 0
4 ? 1 1 1
5 1 ? 0 1


However, under the sharp null, \(Y_{1,i}=Y_{0,i}\) for all \(i\), . . .

Consider Hypothetical RCT

  • Under sharp null:

    • \(Y_i = Y_{1,i}=Y_{0,i}\) for all \(i\).

    • Under sharp null hypothesis, we can fill in missing elements of the table.

\(i~~~~\) \(Y_{0,i}\) \(Y_{1,i}\) \(X_i\) \(Y_i\)
1 (1) 1 1 1
2 (1) 1 1 1
3 0 (0) 0 0
4 (1) 1 1 1
5 1 (1) 0 1

Randomization inference: exploit randomization procedure and that we know both \(Y_{0,i}\) and \(Y_{1,i}\) under the null to find distribution of test statistic under the null, and thus to find p-values and test the null.

Randomization Inference:

  • Sharp null,
    • \(\mathbb{H}_0: Y_{1,i} - Y_{0,i}=0\) for all \(i\), vs.
    • \(\mathbb{H}_1: Y_{1,i} - Y_{0,i} \ne 0\) for some \(i\).
  • Chose test statistic
    • e.g., \(T_N = | \bar{Y}_1 - \bar{Y}_0 |\).
  • Calculate distribution of test statistic under the null to construct p-values, perform test.

Distribution of \(T_N\) Under Sharp Null

  • There are \({5 \choose 3}=10\) ways to assign treatment to \(3\) out of \(5\) subjects.

  • Under sharp null, we can calculate value of test statistic for each way to assign treatment.

  • Suppose randomization procedure puts equal probability on each possible way to assign treatment. Then we can find distribution of test statistic under sharp null.

Distribution of \(T_N\) Under Sharp Null

\(i\) \(Y_i\) \(X_1\) \(X_2\) \(X_3\) \(X_4\) \(X_5\) \(X_6\) \(X_7\) \(X_8\) \(X_9\) \(X_{10}\)
1 1 1 1 1 1 1 1 0 0 0 0
2 1 1 1 1 0 0 0 1 1 1 0
3 0 1 0 0 1 1 0 1 1 0 1
4 1 0 1 0 1 0 1 1 0 1 1
5 1 0 0 1 0 1 1 0 1 1 1
\(T_N=\) 1/3 1/2 1/2 1/3 1/3 1/2 1/3 1/3 1/2 1/3

Under sharp null, we know value of \(T_N\) for each possible treatment assignment. Using that randomization procedure makes each possible treatment assignment equally likely, we thus know distribution of \(T_N\) under the sharp null.

Distribution of \(T_N\) Under Sharp Null

\(i\) \(Y_i\) \(X_1\) \(X_2\) \(X_3\) \(X_4\) \(X_5\) \(X_6\) \(X_7\) \(X_8\) \(X_9\) \(X_{10}\)
1 1 1 1 1 1 1 1 0 0 0 0
2 1 1 1 1 0 0 0 1 1 1 0
3 0 1 0 0 1 1 0 1 1 0 1
4 1 0 1 0 1 0 1 1 0 1 1
5 1 0 0 1 0 1 1 0 1 1 1
\(T_N=\) 1/3 1/2 1/2 1/3 1/3 1/2 1/3 1/3 1/2 1/3

Under sharp null:

  • \(\Pr[T_N = 1/3] = 6/10\).

  • \(\Pr[T_N = 1/2] = 4/10\).

  • If \(T_N=1/3\), p-value\(=1\).

  • If \(T_N=1/2\), p-value\(=0.4\).

Applying Randomization Inference to PROGRESA

  • Review: Hypothesis Testing

  • Randomization Inference

  • Applying Randomization Inference to PROGRESA

  • Multiple hypothesis testing

Randomization Inference: PROGRESA

Complication:

  • Randomization done at village level, not child level.
    • Modify procedure to consider possible treatment as village level.

Randomization Inference: PROGRESA

Complication:

  • Randomization of 308 out of 491 villages.
    • Very large number of ways to assign treatment to 308 out of 491 villages. . .
choose(491,308)
[1] 2.452172e+139

Randomization Inference: PROGRESA

Complication:

  • Not feasible to enumerate all possible \({491 \choose 308}\) ways to assign treatment.
  • Solution: use stochastic approximation
    • Randomly draw \(K\) out of \({491 \choose 308}\) possible treatment assignments.
    • Base randomization inference on those \(K\) possible treatment assignments, justified as \(K \rightarrow \infty\).

Randomization Inference: PROGRESA

  • Randomization inference on PROGRESA Sample
    • “clustering” at village level
    • Using 10,000 from the \({491 \choose 308}\) possible ways to assign treatment to 308 out of 491 villages.

Randomization Inference: PROGRESA

  • Randomization inference on PROGRESA Sample
    • Outcome is school enrollment
    • Test statistic is \(T_N = | \bar{Y}_1 - \bar{Y}_0 |\).

Randomization Inference: PROGRESA

  • Randomization inference on PROGRESA Sample

    • observed value of test statistic: 0.0392395
    • p-value: 6^{-4}
    • p-value \(< 0.05\), so reject at \(0.05\)-significance level.
                   term   estimate two_tailed_p_value
1 Custom Test Statistic 0.03923947              6e-04

Code:

library(ri2, quietly = TRUE)

#declaring randomization procedure
n.0<- length(unique(dfPost$sooloca))   
m.0<- length(unique(subset(dfPost,
          dfPost$treat==1)$sooloca))  
declaration_ra<-declare_ra(N=n.0,
          m=m.0,clusters=dfPost$sooloca)
declaration_ra

# perform randomization inference 
ri_out <-  conduct_ri(
    test_function = function(d){
       abs(mean(d[d$treat==1,]$school) 
         -mean(d[d$treat==0,]$school))}, 
    declaration = declaration_ra, 
    assignment = "treat", 
    outcome = "school", 
    data = dfPost, 
    sims = 10000 
  )

# showing results
plot(ri_out)
ri_out
  • Using package ri2 to perform randomization inference
    (ri2 documentation).

Code:

library(ri2, quietly = TRUE)

#declaring randomization procedure
n.0<- length(unique(dfPost$sooloca))   
m.0<- length(unique(subset(dfPost,
          dfPost$treat==1)$sooloca))  
declaration_ra<-declare_ra(N=n.0,
          m=m.0,clusters=dfPost$sooloca)
declaration_ra

# perform randomization inference 
ri_out <-  conduct_ri(
    test_function = function(d){
       abs(mean(d[d$treat==1,]$school) 
         -mean(d[d$treat==0,]$school))}, 
    declaration = declaration_ra, 
    assignment = "treat", 
    outcome = "school", 
    data = dfPost, 
    sims = 10000 
  )

# showing results
plot(ri_out)
ri_out
  • sooloca is village id.
  • n.0 is no. of villages.
  • m.0 is no. of treated villages.
  • using declare_ra from package ri2 to declare that randomization was at village level, with each of the ways to assign treatment to m.0 out of n.0 villages equally likely.

Code:

library(ri2, quietly = TRUE)

#declaring randomization procedure
n.0<- length(unique(dfPost$sooloca))   
m.0<- length(unique(subset(dfPost,
          dfPost$treat==1)$sooloca))  
declaration_ra<-declare_ra(N=n.0,
          m=m.0,clusters=dfPost$sooloca)
declaration_ra

# perform randomization inference 
ri_out <-  conduct_ri(
    test_function = function(d){
       abs(mean(d[d$treat==1,]$school) 
         -mean(d[d$treat==0,]$school))}, 
    declaration = declaration_ra, 
    assignment = "treat", 
    outcome = "school", 
    data = dfPost, 
    sims = 10000 
  )

# showing results
plot(ri_out)
ri_out
  • using declare_ra from package ri2 to declare that randomization was at village level, with each of the ways to assign treatment to m.0 out of n.0 villages equally likely.

Code:

library(ri2, quietly = TRUE)

#declaring randomization procedure
n.0<- length(unique(dfPost$sooloca))   
m.0<- length(unique(subset(dfPost,
          dfPost$treat==1)$sooloca))  
declaration_ra<-declare_ra(N=n.0,
          m=m.0,clusters=dfPost$sooloca)
declaration_ra

# perform randomization inference 
ri_out <-  conduct_ri(
    test_function = function(d){
       abs(mean(d[d$treat==1,]$school) 
         -mean(d[d$treat==0,]$school))}, 
    declaration = declaration_ra, 
    assignment = "treat", 
    outcome = "school", 
    data = dfPost, 
    sims = 10000 
  )

# showing results
plot(ri_out)
ri_out
Random assignment procedure: Cluster random assignment 
Number of units: 14996 
Number of clusters: 491
Number of treatment arms: 2 
The possible treatment categories are 0 and 1.
The number of possible random assignments is approximately infinite. 
The probabilities of assignment are constant across units: 
   prob_0    prob_1 
0.3727088 0.6272912 

Code:

library(ri2, quietly = TRUE)

#declaring randomization procedure
n.0<- length(unique(dfPost$sooloca))   
m.0<- length(unique(subset(dfPost,
          dfPost$treat==1)$sooloca))  
declaration_ra<-declare_ra(N=n.0,
          m=m.0,clusters=dfPost$sooloca)
declaration_ra

# perform randomization inference 
ri_out <-  conduct_ri(
    test_function = function(d){
       abs(mean(d[d$treat==1,]$school) 
         -mean(d[d$treat==0,]$school))}, 
    declaration = declaration_ra, 
    assignment = "treat", 
    outcome = "school", 
    data = dfPost, 
    sims = 10000 
  )

# showing results
plot(ri_out)
ri_out
  • using function conduct_ri from package ri2 to perform randomization inference.

  • test_function is defining test statistic, here \(T_N = |\bar{Y}_1-\bar{Y}_0|\).

Code:

library(ri2, quietly = TRUE)

#declaring randomization procedure
n.0<- length(unique(dfPost$sooloca))   
m.0<- length(unique(subset(dfPost,
          dfPost$treat==1)$sooloca))  
declaration_ra<-declare_ra(N=n.0,
          m=m.0,clusters=dfPost$sooloca)
declaration_ra

# perform randomization inference 
ri_out <-  conduct_ri(
    test_function = function(d){
       abs(mean(d[d$treat==1,]$school) 
         -mean(d[d$treat==0,]$school))}, 
    declaration = declaration_ra, 
    assignment = "treat", 
    outcome = "school", 
    data = dfPost, 
    sims = 10000 
  )

# showing results
plot(ri_out)
ri_out
  • declaration is defining randomization procedure, defined above with declare_ra function.

Code:

library(ri2, quietly = TRUE)

#declaring randomization procedure
n.0<- length(unique(dfPost$sooloca))   
m.0<- length(unique(subset(dfPost,
          dfPost$treat==1)$sooloca))  
declaration_ra<-declare_ra(N=n.0,
          m=m.0,clusters=dfPost$sooloca)
declaration_ra

# perform randomization inference 
ri_out <-  conduct_ri(
    test_function = function(d){
       abs(mean(d[d$treat==1,]$school) 
         -mean(d[d$treat==0,]$school))}, 
    declaration = declaration_ra, 
    assignment = "treat", 
    outcome = "school", 
    data = dfPost, 
    sims = 10000 
  )

# showing results
plot(ri_out)
ri_out
  • assignment is defining treatment variable.

  • outcome is defining outcome variable.

  • sims is number of random draws of possible treatment assignments.

Code:

library(ri2, quietly = TRUE)

#declaring randomization procedure
n.0<- length(unique(dfPost$sooloca))   
m.0<- length(unique(subset(dfPost,
          dfPost$treat==1)$sooloca))  
declaration_ra<-declare_ra(N=n.0,
          m=m.0,clusters=dfPost$sooloca)
declaration_ra

# perform randomization inference 
ri_out <-  conduct_ri(
    test_function = function(d){
       abs(mean(d[d$treat==1,]$school) 
         -mean(d[d$treat==0,]$school))}, 
    declaration = declaration_ra, 
    assignment = "treat", 
    outcome = "school", 
    data = dfPost, 
    sims = 10000 
  )

# showing results
plot(ri_out)
ri_out

                   term   estimate two_tailed_p_value
1 Custom Test Statistic 0.03923947              6e-04

Randomization Inference: Subgroups

  • For any subgroup defined at baseline, we can test sharp null of no effect within that subgroup.
    • For example, for boys and girls separately
    • For example, by highest grade completed at baseline.

Randomization Inference: Boys

  • Randomization inference on Sample of Boys

  • p-value: 0.0124

  • p-value \(< 0.05\), so reject at \(0.05\)-significance level.

Randomization Inference: Girls

  • Randomization inference on Sample of Girls

  • p-value: 0.0012

  • p-value \(< 0.05\), so reject at \(0.05\)-significance level.

Randomization Inference by Grade

p-values by sex and grade
Grade boys girls
1 0.349 0.609
2 0.006 0.926
3 0.702 0.337
4 0.0003 0.022
5 0.294 0.077
6 0.002 0.008
7 0.011 0.0003
8 0.564 0.634
9 0.285 0.681
10 0.264 0.667

Randomization inference conditional on sex and grade of child:

  • Showing p-values for sharp null of no effect.

  • How would you summarize the evidence of whether PROGRESA has an effect, and on whom?

  • Possible concern: data mining.

    • Correct for multiple hypothesis testing using, e.g., Holm’s Procedure?

Multiple hypothesis testing

  • Review: Hypothesis Testing

  • Randomization Inference

  • Applying Randomization Inference to PROGRESA

  • Multiple hypothesis testing

Multiple Hypothesis Testing

  • Suppose our goal is to determine for which (if any) of the grades does the treatment have an effect, i.e., we want to test sharp null for each of the 10 grades.

    • Consider null hypothesis and alternative for each \(j=1,...,10\),
      • \(\mathbb{H}_{0,j}\): sharp null holds for grade \(j\),
        vs
      • \(\mathbb{H}_{1,j}\) : sharp null does not hold for grade \(j\).

Multiple Hypothesis Testing

  • Consider null hypothesis and alternative for each \(j=1,...,10\),
    • \(\mathbb{H}_{0,j}\): sharp null holds for grade \(j\),
      vs
    • \(\mathbb{H}_{1,j}\) : sharp null does not hold for grade \(j\).
  • We are thus testing a family of \(J\) null hypotheses,
    \(\mathbb{H}_{0,1}\),…,\(H_{0,J}\). Testing a family of \(J \ge 2\) null hypothesis is called a multiple hypothesis testing problem.

FWER

Definition: FWER

  • Family Wise Error Rate is probability of one or more false rejections among the family of hypotheses.
  • In our PROGRESA example, the FWER is the probability of falsely concluding that the treatment has an effect for at least one grade for which it actually has no effect.

FWER

Definition: FWER

  • Family Wise Error Rate is probability of one or more false rejections among the family of hypotheses.
  • If we test each of \(J \ge 2\) null hypotheses separately, each at the \(\alpha\) significance level, the FWER will often be far higher than \(\alpha\), especially if the number of true null hypotheses is large.

Correcting p-values to Control FWER

  • Instead of testing so that the probability of falsely rejecting each null is at most \(\alpha\), we we can design our testing procedure so that the FWER is at most \(\alpha\).

  • Let \(\widehat p_j\) denote the p-value for testing \(\mathbb{H}_{0,j}\). Multiple hypothesis testing procedures typically use as inputs \((\widehat p_{1},...,\widehat p_{J})\), and “correct” the p-values for the multiplicity of hypotheses.

  • The simplest such procedure is the Bonferonni Procedure.

Bonferonni

Definition: Bonferonni Procedure

Reject those \(\mathbb{H}_{0,j}\) for which \(J \times \widehat{p}_j \le \alpha\)

  • \(J \times \widehat p_j\) is p-value corrected for multiple hypothesis testing.

  • Can implement in R using function p.adjust with option method="bonferroni".

    • For example, if p is a vector of p-values, then p.adjust(p, method="bonferroni") returns vector of p-values with Bonferroni correction.

Bonferonni

Theorem

The Bonferonni Procedure results in a FWER \(\le \alpha\).

  • Advantage:
    • Controls FWER.
  • Disadvantage:
    • Correction makes it far harder to reject null hypotheses, increasing probability of Type II errors for false null hypotheses.

Alternative: Holm Step-Down Method

Definition: Holm’s Procedure

Suppose there are \(J\) hypotheses.

  1. Rank your p-values from smallest to largest. Let \(k\) index the ranks.

  2. Start with \(k=1\).

  3. Compute \(\frac{\alpha}{J+1-k}\). This is your critical value for rank \(k\).

  4. If \(\widehat p_k\) is less than this critical value, reject hypothesis \(k\). Otherwise, do not reject hypothesis \(k\).

  5. If hypothesis \(k\) was rejected, repeat step (3) for \(k+1\). If hypothesis \(k\) was not rejected, the process ends here, and all other hypotheses are not rejected.

Alternative: Holm Step-Down Method

  • Can implement in R with function p.adjust using option method="holm".
    • For example, if p is a vector of p-values, then p.adjust(p, method="holm") returns vector of p-values with Holm’s correction.

Alternative: Holm Step-Down Method

Theorem The Holm’s Procedure results in a FWER \(\le \alpha\), and rejects at least as many false null hypotheses as the Bonferroni procedure.

  • Holm Step-Down Method controls the FWER while rejecting at least as many (and possibly more) false null hypotheses as with the Bonferroni correction.

PROGRESA, P-Values: Testing Sharp Null

Boys
Uncorrected Bonferroni Holm’s
1 0.349 1 1
2 0.006 0.062 0.050
3 0.702 1 1
4 0.0003 0.003 0.003
5 0.294 1 1
6 0.002 0.023 0.021
7 0.011 0.109 0.076
8 0.564 1 1
9 0.285 1 1
10 0.264 1 1
Girls
Uncorrected Bonferroni Holm’s
1 0.609 1 1
2 0.926 1 1
3 0.337 1 1
4 0.022 0.225 0.180
5 0.077 0.768 0.538
6 0.008 0.077 0.069
7 0.0003 0.003 0.003
8 0.634 1 1
9 0.681 1 1
10 0.667 1 1